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Abstract 

We consider the problem of community detection from observed interactions between 
individuals, in the context where multiple types of interaction are possible. We use labelled 
stochastic block models to represent the observed data, where labels correspond to inter- 
action types. Focusing on a two-community scenario, we conjecture a threshold for the 
problem of reconstructing the hidden communities in a way that is correlated with the true 
partition. To substantiate the conjecture, we prove that the given threshold correctly iden- 
tifies a transition on the behaviour of belief propagation from insensitive to sensitive. We 
further prove that the same threshold corresponds to the transition in a related inference 
problem on a tree model from infeasible to feasible. Finally, numerical results using belief 
propagation for community detection give further support to the conjecture. 



1 Introduction 

Community detection consists in the identification of underlying clusters of individuals with 
similar properties in an overall population. It is relevant in vastly diverse contexts such as 
biology and sociology, where one might want to classify proteins or humans respectively, based 
on their interactions. Most methods assume interactions to be described by a graph, whose 
edges represent pairs of individuals known to interact. They then amount to graph clustering, 
with potentially distinct flavours: assortative communities see more interactions within than 
across communities, while the opposite holds in the disassortative case. 

The stochastic block model provides a versatile model of community structure, allowing 
representation of diverse scenarios and analytical comparison of candidate algorithmic detection 
procedures. In this model, nodes are partitioned into blocks, and an edge is present between 
any two nodes with a probability depending only on the blocks to which each of the two nodes 
belong. Despite its simplicity, this model already displays rich behaviours, some of which are 
not yet fully understood. One phenomenon of practical interest consists in a phase transition 
from a situation where the graph of interactions does not reveal any structure, to one where 
it reflects some of the underlying structure. In the latter case, algorithmic procedures such as 
belief propagation can perform non-trivial classifications of nodes. 

The simplest example of this situation consists in a model with n nodes partitioned into two 
equal-size blocks, and where two nodes are connected with probability a/n or b/n depending 
on whether they belong to the same block or not. Then it is known that the Condition 

{a-bf > 2(a + b) (1) 
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is necessary for reconstruction, i.e. cluster in a way correlated with the true partition. Mossel 
et al. p] have indeed shown that, if it is violated, then the distribution of the observed graph 
is absolutely continuous with respect to that of an unstructured fully symmetric random graph 
without underlying block structure. When this condition holds, it is conjectured by Decelle et 
al. [2] that the underlying block structure can at least partially be recovered by belief propa- 
gation. Beyond their theoretical interest, such threshold phenomena also have some practical 
implications: they indicate what amount of downsampling or perturbation of original data can 
be tolerated before all useful information is lost. 

Three elements support the conjecture that under Condition ([I]) community detection is 
possible. First, Decelle et al. [2] show that it implies sensitivity of belief propagation to noise. 
Second, it is known to correspond to a certain reconstruction threshold for a model of infinite 
random trees, whose structure locally resembles that of the stochastic block model. Third, 
numerical evaluations indicate the ability of belief propagation to retrieve some of the underlying 
structure under ([!]). 

In the present work, we initiate an investigation of similar phenomena in the more general 
context of labelled stochastic block models. In such models the observation of an interaction be- 
tween any two individuals is enriched with a label which represents that interaction's particular 
type. Many applications of community detection naturally feature such labels. Protein-protein 
chemical reactions may be exothermic or endothermic; (movie-user) associations in collabora- 
tive filtering typically come with user ratings; email exchanges may be cold, formal, or familiar; 
etc. 

Our main contribution consists in a generalization of Condition ([I]) describing the transition 
from unidentifiable to identifiable to the context of labelled stochastic block models. Specifi- 
cally, after introducing necessary notation and our main conjecture in Section [3j we show in 
Section [4] that our generalized condition corresponds to the transition between insensitivity to 
sensitivity in belief propagation. We then show in Section [5] that it also coincides with the 
reconstruction threshold for the corresponding labelled tree model. The conjecture is further 
validated numerically in Section [6] where belief propagation is shown to achieve useful detection 
only above the threshold. Conclusions are drawn in Section [7j 

2 Related Work 

Several works address community detection in the un-labelled stochastic block model. The 
two main approaches are based on belief propagation and spectral methods. Spectral methods 
typically ensure consistent reconstruction in regimes with high average degree. An early 

reference is McSherry [3j. More recently Rohe et al. [1] use Laplacian spectra, and address 
growing numbers of communities, but still require high (oj(1)) connectivity. Decelle et al. [2] 
rely on belief propagation, and heuristically determine a threshold for detectability in a "sparse" 
regime, where node degrees are of order 1. 

The related problem of tree reconstruction has initially been considered by Evans et al. 
[5], who identified a threshold on the tree's mean degree above which reconstruction is feasible 
through a simple "census" method. This threshold was later shown to correctly identify the 
onset of "robust reconstruction" by Janson and Mossel [6] . We refer to [7] for a survey of this 
area. 

A complete understanding of the relation between thresholds for community detectability 
in block models and reconstrruction in associated tree models is still missing. See, however, 
Gerschenfeld and Montanari [8] for conditions under which the two thresholds coincide. For 
the symmetric two-community case, Mossel et al. [T] show that the threshold for community 
detectability is at least as large as that for tree reconstruction; Coja-Oghlan [9] determines an 
upper bound on the threshold for community detection, that is believed to be loose. 

In contrast, to the best of our knowledge the problem of community detection and tree 
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reconstruction in the labelled case has not been explicitly considered in the literature. 



3 Model description and main conjecture 

In the sequel we focus on the simplest non-trivial labelled stochastic block model, which is 
defined as follows. A total of n nodes are split into two equal-size blocks, namely block and 
block 1. The type of any given node % G {1, . . . , n} refers to the block it belongs to, and is 
denoted by <7j G {0,1}. Any two nodes i, j are related with probability a/n if = aj, and 
with probability b/n otherwise, where a, b are two positive constants. Furthermore, given any 
two related nodes i, j, one observes a label L^j taking its values in some finite set C Label Ly 
is drawn from distribution {n(£)}e e c if &i = and from distribution {v(l)}n & c otherwise. 

Note that the present model generalizes the one studied in Mossel et al. PQ, to which it 
reduces when the labels do not bring extra information relative to the types of the underlying 
nodes, that is when fi{£) = u{£). In this context, we make the following conjecture: 

Conjecture: In the labelled stochastic block model with two symmetric blocks, connectivity 
parameters a, b > and label distributions (a, v, reconstruction is infeasible if r < 1, while it is 
feasible when r > 1, where the threshold value r is defined as 

v a/j(l) + Ml) ( a^)-bv{() \ 2 
T t c « + b W(£) + bv(t)J ' [Z) 

and A := (a + b)/2 is the mean degree in the corresponding block model. 

Note that this extends the conjecture made for the un-labelled case in p], as the Condition 
r > 1 simplifies to ([I]) when = v{&). We will now establish several results supporting this 
conjecture. 



4 Phase transition for belief propagation sensitivity 

We first introduce a labelled tree which can be coupled with the original graph, see Proposition 
5.2 in pQ (the only difference here is the addition of labels on edges). Consider the following 
random tree version of the reconstruction problem. Starting from a root node r with type 
a r 6 {0, 1}, consider a branching process with the following characteristics. Each node i with 
type <7j gives birth to a number of children of type t = Oj with Poisson distribution Poi(a/2) 
and to a number of children of type t = 1 — a% with Poisson distribution Poi(6/2). Conditional 
on the types (t, t') of a (parent-child) pair a label is attached to the edge drawn 

independently of everything else with distribution \i if t = t', and with distribution v lit^t 1 . 

Consider now such a tree up to depth d, that we denote 7~d- For each node i £ Td, denote 
by Td(i) the subtree rooted at node i, together with its labels. Let Xi = P(<7j = l\Td{i)), and 



1 — Xi 



Bayes formula entails that 



Ri= n 

j child of 



Xjdfj,(Lij) + (1 - Xj)bu(Lij) 
Xjbv{Lij) + (1 - Xj)afi(Lij) 



This readily reduces to a recursion in terms of the random variables Rj\ 

Rjafi(Lij) + bv(Lij) 



Ri= n 



. child of KML^ + MUi) 
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It also follows at once from these expressions that if one starts from uniform beliefs (X = 1/2 
or equivalently R = 1 on the leaves), then uniform beliefs constitute a fixed point. 

Following Decelle et al. [2] , we introduce the following notion of robustness to noise for this 
fixed point: 

Definition 1. Assume that belief ratios R for leaf nodes at depth less than d are fixed to 1. 
The belief ratio R r at root r is then determined by induction from the belief ratios Rj of nodes 
at depth d, i.e. j G dTd, through a map F^: R r = Fd(Rj,j S dTd). 

The infinitesimal sensitivity x{d) of the root belief R r to noise at depth d is defined as 

X (d) = lim i Var{F d {l + e&, j G dT d )\T d ) , (3) 

where the are i.i.d. unit variance random variables. The fixed point R = 1 is then said to be 
insensitive to noise if lim^oo x(d) = ; and sensitive to noise if lim^oo x(d) = +oo. 

With these definitions at hand, we are ready to state the following 

Theorem 1. Let r be defined by expression |Ip. Then the fixed point R = 1 is insensitive to 
noise if r < 1 and sensitive to noise if r > 1 . 

Before we prove the theorem, let us comment on the implications. As conjectured in Decelle 
et al. in the case of un-labelled data, community detection is infeasible in an instance which is 
insensitive to noise, while it is feasible (i.e. some reconstruction classifying correctly more than 
half the nodes) in an instance that is sensitive to noise. This leads us to state the conjecture in 
Section [3l 

Before proving Theorem [T] we need a technical result. Consider thus a branching process 
with Poisson offspring distribution with mean A for some A > 1. In addition, each parent-child 
edge in the corresponding branching tree is endowed with a real weight. All weights W are 
sampled in an i.i.d. fashion with moment generating function: (f(9) = E [e sw ~\ < oo. 

We let N(d) denote the number of descendants in the d— th generation. We further let 
N + {d,s) (resp. N~(d,s)) denote the number of such descendants whose sum of weights along 
the path from the ancestor to them is larger (resp. smaller) than ds. 

Let us now introduce the so-called rate function h as follows. First, we let 

h (x) := sup (xy - \og(ip{y))) . 

yeR 

This is the so-called Cramer transform of the weights distribution, which by Cramer's theorem 
determines the behaviour of large deviations of empirical means (1/d) X^t=i °f i-i-d. weights 
from their expectation id := ^'(0)- Let now w~ and w + be defined as 

J w + = inf{x > w : ho(x) > log A}, 
1 w~ = sup{x < id : ho(x) > log A}. 

We then let 

h(x):={ k °^ if X £ \- w ~> w+ }> (4) 
1 +oo otherwise. 

We are now ready to state the following 

Theorem 2. For any x > w, x ^ w + , on the event that the branching process survives indefi- 
nitely, one has the almost sure convergence 

lim (N + (d,x)) 1/d = Xe- h{x) . (5) 

d— >co 

Similarly, for all x < w, x ^ w~ , on the event that the branching process survives indefinitely, 
one has 

lim (N-(d,x)) 1/d = Xe- h(x l (6) 

d— >oo 
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Proof. We only prove ([5]), as the other property ^ is shown similarly. Consider first the case 
where x > w + . The expectation of the summation in (|5|) reads 



BN + (d,x) = X d P (l2 W t - xd ^J 



Chernoff's bound implies that this is no larger than e d(log\-h (x)) _ Being 

an integer-valued 

random variable, the summation is then positive only with probability at most e d ^ og x - h o( x )). 
By Borel-Cantelli's lemma, it is then positive only for finitely many d's. Thus the limit in ^ 
is 0, as announced. 

The case where x € [w, w + ) follows from a general result for branching random walks |10j . 
Indeed consider the random measure on R: 

N(d) 

z {d) = y, <w> 
i=i 

where X% is the sum of the weigths along the path from the ancestor to the i-th individual in 
generation d. Note that we have N + (d,x) = Z^[xd; oo . 
It is well-known that 

MW(x) := (\ip(x)y d [ e xy z( d \dy), 



is a positive martingale and hence has an almost sure limit M(x) as d tends to infinity. For 
x € (w~,w + ), as shown in |10| . the limit M(x) is stricly positive if the process survives. Then 
Theorem 4 in [10] implies that for any fixed < h as d tends to infinity: 

l/d 



(zW[xd-h,xd + h]\ -^\e- h ° ix \ 

This clearly gives a lower bound to ([5]). The upper bound is easily obtained by the following 
argument: 

™^ M ,oc) < Jm« (d9) 

= e- 9xd M (d \e)\ d <p(6) d , 

minimizing over 9 < w + (which ensures that ]im dr + 00 MW(9) = M(0) > 0) gives the desired 
result. □ 

Let us now prove Theorem [T] We first determine an expression for the infinitesimal sensi- 
tivity x{d)- Using linearization, we have that 

•sp -pr / d_ Ran(L uv ) + bu(L uv ) \ 2 

' -hltM ^ A , ^ ^ 9R MLuv) + Rbu(L uv ) R= J ' 
}&T{d) (uv)epath (j~r) 

The derivative in the above formula reads 

d Rafj,(L uv ) + bv{L uv 



dR afi(L uv ) + Rbv{L v 



uv ) 

r=i an(L uv ) + bu(L uv ) 



Let us denote the absolute value of this expression by e Wuv for some suitably defined weight 
W uv , so that 



j£T(d) \(w)epath (r~j) 
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Note that in the present model, thanks to symmetry between the two classes 0, 1, the labels 
L uv are i.i.d. , with probability distribution P(L = £) = gM^2±M^) _ 

We are thus in the setup of Theorem [2j with a dsitribution for the weights suitably derived 
from this label distribution and the transform W = log(|a/i(L) — bu(L)\/(a/J,(L) + bu{Lj). 

We then have, from Theorem [2j applying the Laplace method, the exponential equivalent: 

jlogx(d) ~ logA + sup(2x- h(x)). (7) 

Consider the modified expression sup x (2x — ho(x)), and let x* denote the point attaining this 
supremum. By convexity of Kq and the fact that it achieves its minimum at w, necessarily 
x* > w. This supremum equals logEe 2W/ by convex duality. Note also that x* < 0, since the 
support of the distribution of W is in M~ . Consider first the case where r > 1 , or equivalently, 

logA + logEe 2M/ > 0. 

We then have ho(x*) = 2x* — logEe 2W/ < log A by the above condition, so that ho(x*) = h(x*). 
Thus the logarithmic equivalent ([7]) reads log(r) and is strictly positive. We thus have sensitivity 
to perturbations. 

Consider next the case where r < 1, i.e. logEe 21 "^ < — log A. In that case, the logarith- 
mic equivalent ([7]) is upper-bounded by log(r) and is thus strictly negative. Insensitivity to 
perturbations follows. 



5 Phase transition for reconstructability on labelled trees 

In this section, T is an infinite tree with types a £ {0, 1} on its vertices and labels L on its 
edges. To have consistent notation with previous section, a child has the same type as its 
parent with probability Given that the child has the same type as its parent, its label is 
distributed as n(£), otherwise it is distributed according to v{£). Note that if T is a realization 
of a Galton- Watson tree with offspring distribution Poi (^2 ) conditioned on non-extinction, 
we get exactly the same tree model as in the previous section. In this section, the underlying 
tree is fixed (i.e. non-random) so that the only randomness considered here is associated with 
the types of the vertices and the labels of the edges. 

We denote by Po and Eo the probability distribution and expectation conditional on the 
labels of the edges of the tree. We define the function e : £ — > [0, 1/2] by 

a If j is a child of i, we have 

Po Oi o-j) = t(Lij). 

We now give an alternative description of the random types of the vertices of the tree when 
the labels of the edges are known, i.e. conditionally on the labels. At the root r of the tree 
T a binary random variable is chosen uniformly at random. This type is then propagated, 
with error, throughout the tree as follows: the child j of the vertex i receives the type of i 
with probability 1 — e(Ljj), and the opposite type with probability e(Lij). These events at 
the vertices are statistically independent. This model has been studied in information theory, 
mathematical genetics and statistical physics when the function e is constant. We refer to [5] 
for references. 

Suppose we are given the types that arrived at the d-th level dTd of the tree T ■ Observing the 
labels of the edges and using optimal reconstruction strategy (maximum likelihood), the prob- 
ability of correctly reconstructing the original type at the root is denoted by (1 + A(T,d)) /2, 
where clearly A(T, d) > 0. 
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For an infinite tree T, we denote by A = limsup rf |<97d| 1//d its growth rate. Note that our 
notation is consistent with the previous section, as in the case where T is a realization of a 
Galton- Watson tree with offspring distribution Poi (^^), A = a.s. We still define r by the 
expression ([2]). Adapting the argument of j5], we are able to show: 

Theorem 3. Let T be an infinite labelled tree with root r as defined above. Consider the problem 
of reconstructing the type of the root a r from the types at the d-th level dTd of T and the labels 
on the tree. 

1. Ifr > 1 then ini d >i A(T, d) > 0; 

2. If t < 1 then inf d >i A(T, d) = 0. 

Proof. Following [5], we derive a lower bound for A (7", d) in terms of the effective electrical 
conductance from the root r to dTd and an upper bound which is the maximum flow from r to 
dTd for certain edge capacities. We refer to [11J for background on these notions. 

For the conductance lower bound, we follow Section 5 of [5] and for each edge (i, j), j a 
child of i, we define 9ij = 1 — 2e(Lij) = ^(L^^+bu^jL- 3 -) ana ~ then assign the resistance 

Rij = (1 - Ofj) Y\ ®uvi 
(uiOepath (r~j) 

where path (r ~ j) is the path from the root r to node j. We also define for each vertex i 

©i = W 9uv 

(m>)epath (r~i) 

By Theorem 1.2' and 1.3' of [5], we have 

^)> 1 + n J^ gTd) A(T,^<2j>?, 

where 7£ e ff (r -B- dTd) is the effective resistance between the root r and the <i-th level of the tree. 
We first prove our second claim. Note that 



2 

b y at i(£) + bu(£) J ~ A' 



so that we have for r < 1, 



En 



iedT d 



\9T d \ [ -) -+0, 



as <i tends to infinity. Hence by Fatou's lemma, we have 

liminf Of = a.s., 

and our second claim holds. 

Our first claim will hold, once we prove that for r > 1, we have 7^ e ff (r ^ 00) = sup d>1 7^ e ff (r -H- 
dTd) < 00 • This fact follows indeed from a computation done in [T2]. Define the resistance 
R'ij = ri( ul ,)epath (r~i) • Note that i n our framework the labels of the edges are i.i.d. with 
distribution a Mi)+Mi) _ j n particular the random variables 6 UV are also i.i.d. and since 8 UV < 1, 
we have mino< z <iEo [0%%] = En [0^ v ] so that by Theorem l(i) of [12J, for r > 1, we have 
7^' e g-(r o 00) < 00 a.s. Since R uv < R' uv , we have by Rayleigh's monotonicity law (see [H]), 
7£ e gf(r 00) < ^' e ff(^ ^ co) < 00 a.s. 

□ 
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6 Numerical results 




Figure 1: Overlap Q as a function of the parameter e (left: a = b; right a < b) 

We now investigate numerically the validity of our proposed conjecture. We consider first a 
labelled stochastic block model with two symmetric blocks where the connectivity parameters 
are identical, i.e. a = b, so that community detection can only succeed based on the labels. We 
assume for simplicity only two labels + and — and define the distributions /u(+) = p for edges 
among nodes of the same type and v{+) = q for edges between nodes of different type for two 
parameters p,q E [0,1]. In this case, Condition ([I]) does not hold, yet reconstruction may still 
be feasible depending on the values of p and q. In order to validate our conjecture that if the 
value r given in ^ is greater than 1, reconstruction may be feasible, we parametrize p = ^ + e 
and q = ^ — e, which leads to the simplified condition for reconstruction: 

We characterize the success of the reconstruction using the overlap metric introduced by 
Decelle et al. in equation (5) of [2] , which we repeat below: 

QlWi}, Wi}) = max — , (9) 

7T 1 — max a n a 

where o~i denotes the original assignment of types to nodes i = 1 . . . n, <7j, denotes the estimated 
assignment, t denotes communities, and nt is the size of community t. In our setup, t = 
or 1 and nt = n/2. Since types may be assigned in different order in the estimate, we vary 
over all permutations 7r(<7j) of oi and take the one with maximum overlap. This overlap metric 
ranges from to 1, equating zero when classification is no better than assigning all nodes to a 
fixed class (or equivalently, assigning nodes to a randomly chosen type). We generate a labelled 
stochastic block model graph with the parameters given above and n = 5000 nodes. Then, 
we use the standard sum-product belief propagation algorithm to infer the types of the nodes 
based on the labels. We vary both the density, i.e. a = b, and e. All plotted values are averages 
over several different seeds. 

In Fig. [T] (left), we plot the overlap metric Q against e on the x-axis for a = b given by 2, 
5, 10. For each curve, we indicate the threshold ^ as a vertical line in the same style as the 
corresponding curve. We observe that to the left of the threshold, Q remains around zero and 
the variation may be attributed to the initial conditions and small-scale effects. To the right of 
the threshold, however, Q increases steadily. 

For comparison, in Fig. [T] (right), we provide the same metric but with a < b given by 
(a, b) = (1, 3), (4, 6), (8, 12). Accordingly, belief propagation can now exploit both edges as 
well as their labels and the corresponding curves are shifted towards the left, along with the 
threshold of e where t = 1, again indicated by a vertical line for each curve. 

It is interesting, that even for reasonably small scales, belief propagation consistently fails 
below the threshold, with overlap close to zero, yet achieves positive overlap above the threshold. 
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7 Concluding remarks 



We have initiated an analysis of community detection in the context of labelled interactions. 
We have formulated a conjecture on when detectability is feasible, in the form of Condition 
Q. While restricted to the two symmetric communities case, this condition is already useful in 
determining how the availability of labels affects detectability. A natural extension will consider 
richer scenarios with more communities, where our techniques can potentially characterize the 
corresponding transition thresholds. On the theoretical front, we have established that two 
phase transitions, namely sensitivity of belief propagation, and tree reconstructability, coincide 
in the case of labelled trees. The main outstanding question there is to validate our conjecture 
that these thresholds characterize the onset of community detectability. 
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